NVIDIA-596: Enable dpu healthcheck #2941
Conversation
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdd DPU node lease configuration: new bootstrap fields and defaults, parse/validate ConfigMap keys, expose string values to templates, inject env vars into ovnkube DaemonSet for DPU modes, pass lease flags from node script to ovnkube, add ConfigMap defaults and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant ConfigMap as hardware-offload ConfigMap
participant Bootstrap as bootstrapOVNConfig
participant Renderer as template renderer
participant K8sAPI as Kubernetes API (DaemonSet)
participant NodeScript as ovnkube node script
participant ovnkube as ovnkube process
ConfigMap->>Bootstrap: read dpu-node-lease-* keys
Bootstrap->>Bootstrap: parse & validate values
Bootstrap->>Renderer: provide DpuNodeLeaseRenewInterval/DpuNodeLeaseDuration (strings)
Renderer->>K8sAPI: create/update DaemonSet with env vars (conditional on node mode)
K8sAPI->>NodeScript: schedule/run ovnkube node script (on node)
NodeScript->>NodeScript: build dpu_lease_flags from env vars
NodeScript->>ovnkube: invoke ovnkube with ${dpu_lease_flags}
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 8 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
62c31b1 to
b5a3d66
Compare
|
/retest-required |
|
Blocked by k8snetworkplumbingwg/multus-cni#1490 |
|
@tsorya Could you please help rebase this PR, then I can build an image to run some pre-merge testing. |
1eb0381 to
6b9ed3a
Compare
done |
|
@yingwang-0320: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| daemon-config.json: | | ||
| { | ||
| "cniVersion": "0.3.1", | ||
| "cniVersion": "1.1.0", |
There was a problem hiding this comment.
Multus was updated to the new version which enables CNI status
There was a problem hiding this comment.
UPdated PR description
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 1094-1102: If either ovnConfigResult.DpuNodeLeaseRenewInterval or
ovnConfigResult.DpuNodeLeaseDuration is 0 we should normalize both to 0 so the
disable semantics are consistent; update the logic around the current checks to
first detect if either field == 0 and set both
ovnConfigResult.DpuNodeLeaseRenewInterval = 0 and
ovnConfigResult.DpuNodeLeaseDuration = 0, otherwise keep the existing validation
that when both are non-zero and DpuNodeLeaseDuration <=
DpuNodeLeaseRenewInterval you log the warning and reset to
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT and DPU_NODE_LEASE_DURATION_DEFAULT.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9e7b87eb-40e5-48db-92c5-608250f639d9
📒 Files selected for processing (3)
bindata/network/ovn-kubernetes/common/008-script-lib.yamlhack/hardware-offload-config.yamlpkg/network/ovn_kubernetes.go
✅ Files skipped from review due to trivial changes (1)
- hack/hardware-offload-config.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- bindata/network/ovn-kubernetes/common/008-script-lib.yaml
|
7db199b to
556c81f
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, tsorya The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
LGTM! |
|
/retest-required |
|
/test? |
|
/retest |
|
/payload 4.22 ci blocking |
|
@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/80e8be70-42fa-11f1-9152-f48b1625dff6-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/80e8be70-42fa-11f1-9152-f48b1625dff6-1 |
|
/payload 4.22 ci blocking |
|
@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d32851d0-4347-11f1-8bbc-ea530d32dce6-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d32851d0-4347-11f1-8bbc-ea530d32dce6-1 |
|
/payload 4.22 ci blocking |
|
@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/46a6e790-45d5-11f1-9a42-2e6691300348-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/46a6e790-45d5-11f1-9a42-2e6691300348-1 |
|
/payload 4.22 ci blocking |
|
@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8d512770-4650-11f1-9e5b-4564d59cf248-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8d512770-4650-11f1-9e5b-4564d59cf248-1 |
|
Latest run
{ fail [k8s.io/kubernetes/test/e2e/apimachinery/crd_validation_rules.go:157]: expect error contains "failed rule", got "the server could not find the requested resource"} pod "network-check-target-5pbpt" is using the default service account
|
|
/verified by @tsorya |
|
@tsorya: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
e8d6ace to
6401033
Compare
|
New changes are detected. LGTM label has been removed. |
Add configurable DPU node lease health monitoring to detect when the DPU-side OVN-Kubernetes component is down or not installed. Without this, pods are scheduled to DPU-accelerated nodes regardless of DPU readiness, causing silent 2-minute CNI ADD timeouts with no visibility or automated remediation. DPU lease configuration: - Read dpu-node-lease-renew-interval and dpu-node-lease-duration from the hardware-offload-config ConfigMap (defaults: 10s / 40s). - Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION env vars into ovnkube-controller for dpu-host/dpu node modes. - Script-lib translates env vars into --dpu-node-lease-renew-interval and --dpu-node-lease-duration CLI flags for ovnkube-node. - Setting renew-interval to 0 disables the health check; duration must always be > 0 (required by ovn-kubernetes). - Lease namespace is derived via downward API (fieldRef). Jira: https://issues.redhat.com/browse/NVIDIA-596 Made-with: Cursor Signed-off-by: Igal Tsoiref <[email protected]> Co-authored-by: Cursor <[email protected]>
Signed-off-by: Igal Tsoiref <[email protected]> Co-authored-by: Cursor <[email protected]>
6401033 to
a5d3f19
Compare
|
@tsorya: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
NVIDIA-596: Enable DPU healthcheck and bump Multus CNI to 1.1.0
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.
DPU lease configuration:
the hardware-offload-config ConfigMap (defaults: 10s / 40s).
env vars into ovnkube-controller for dpu-host/dpu node modes.
and --dpu-node-lease-duration CLI flags for ovnkube-node.
normalized to 0 when either is 0.
Bump Multus CNI API version to 1.1.0:
health check to report NetworkReady=false when the DPU lease expires.
operations are unchanged.
Made-with: Cursor
Jira: https://issues.redhat.com/browse/NVIDIA-596
Summary by CodeRabbit
New Features
Tests